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Abstract 

Neuroscientists have recently shown that images that are difficult to find in visual search elicit similar patterns of firing 
across a population of recorded neurons. The distance between firing rate vectors associated with two images was strongly 
correlated with the inverse of decision time in behaviour. But why should decision times be correlated with distance? What 
is the decision-theoretic basis? In our decision theoretic formulation, we modeled visual search as an active sequential hypothesis 
testing problem with switching costs. Our analysis suggests an appropriate neuronal dissimilarity index which correlates equally 
strongly with the inverse of decision time as the distance. We also consider a number of other possibilities such as the relative 
entropy (Kullback-Leibler divergence) and the Chernoff entropy of the firing rate distributions. A more stringent test of equality 
of means, which would have provided a strong backing for our modeling fails for our proposed as well as the other already 
discussed dissimilarity indices. However, test statistics from the equality of means test, when used to rank the indices in terms 
of their ability to explain the observed results, places our proposed dissimilarity index at the top followed by relative entropy, 
Chernoff entropy and the indices. Computations of the different indices requires an estimate of the relative entropy between 
two Poisson point processes. An estimator is developed and is shown to have near unbiased performance for almost all operating 
regions. 


I. Introduction 


We invite the reader to participate in the following visual search tasks. There are two search tasks on page Find the 
oddball image in each of the two configurations. Based on the time taken for each of the tasks, identify which of the two is 
easier. 


Among the two search tasks on page 26 most subjects find Task 1 the easier, and Task 2 the tougher. Visual search 
performance, as measured by the time taken to find the oddball image, should depend on the “similarity” of the two images. 
One has the natural hypothesis: 

{H) The more “dissimilar” the two images, the shorter the time taken to find the oddball image. 

To test such a hypothesis, one needs a quantification of the notion of “dissimilarity” between two images. Sripati and Olson 
HI proposed one such measure based on neuronal responses (to the images) in the inferotemporal (IT) cortex of the macaque 
brain. They conducted experiments to 1) find the time taken by human subjects in visual search for a number of image pairs, 
and 2) record neuronal responses to the same images from the monkey IT cortex. They found quantitative evidence in support 
of [H) based on their notion of dissimilarity. We now describe their experiments and recall their findings to set the stage for 
this paper. 

The experiments of Sripati and Olson m were the following. 

1) Six human subjects were shown a picture as in Figure]^ on page 26 Six images were placed at the vertices of a regular 
hexagon, with one image being different from the others. To be specific, let A and /; be two images. One of these two was 
picked randomly with equal probability and was placed at one of the six locations randomly, again with equal probability. 
The other image was placed in the remaining five locations. The subjects were required to identify the correct half (left or 
right) of the plane where the oddball image was located. The subjects were advised to indicate their decision “as quickly 
as possible without guessing” Cl. The time taken to make a decisiorQ after the onset of the image was recorded. This 
experiment was repeated on the same subject and across subjects. The average reaction time across trials, denoted s(fc, 1), 
was recorded. Thus s{k,l) is the estimate of the (symmetrised) decision time to distinguish between A and Ii. Similar 
estimates were obtained for several pairs of images. 

2) For capturing neuronal responses to images, Sripati and Olson conducted a set of experiments on macaque monkeys. See 

m for details. A single image A (respectively, //) was displayed on the screen, and the neuronal firings elicited by A 
(respectively, /;) on a set of IT neurons were recorded across multiple sessions. The neuronal representation of the image A 
was taken to be the vector of average firing rates indexed by the neurons. This is denoted ..., Rk{d)), 

where d is the number of tapped neurons. Similarly, the neuronal representation of image R was estimated and denoted 


This work was supported in pait by the Depailment of Science and Technology. The neuronal and behavioral data used in this study was collected by one 
of the authors (S. P. Arun) while he was at the laboratory of Prof. Carl Olson, Carnegie Mellon University. The material in this paper was presented in part at 
the 2012 IEEE International Symposium on Information Theory, Cambridge, MA, USA, July 2012, and was presented in part at the 2015 Information Theory 
and Applications Workshop, San Diego, CA, USA, February 2015. 

*A baseline motor reaction time for each subject was also estimated in a separate experiment and subtracted to get an estimate of the time to make a 
decision. See (T) for details. 
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distance between firing rates 


Fig. 1. Scatter plot of (s(fc, 1) ||Rfe — Ri||i). Sripati and Olson jT] observed a high correlation of 0.95 between the inverse of reaction time and their 
proposed distance between the neuronal firing vectors. 


as the vector R/. The measure of dissimilarity between the two images Ik and Ii was then taken to be the -distance 
normalised by the number of neurons: 

1 ' 

||Rfc - R/||i = - ^ |i?fc(m) - i?;(m)|. (1) 

m—1 

They obtained the scatter plot (s(A:, ||Rfc — Ri|ji)fc_i shown in Fig. where {k,l) varied across image pairs, and 
observed a remarkably high correlation (r = 0.95), thereby providing evidence in support of a quantitative version of 
(R). 

For a detailed discussion of how neural activity in monkey visual cortex can be used to predict human search performance, 
we refer the reader to ID, El, 0, a. 

The experiments of Sripati and Olson JD and Figure [T] suggest a natural question of interest to researchers in information 
and decision theory. One does anticipate that s(fc, 1) is negatively correlated with some notion of dissimilarity between R^ 
and Ri, say diff(Rfe,R;). Figuresuggests 

s(k,l) ■ ||Rfc — R/||i = constant. (2) 

However, we know of no decision theoretic basis for diff(Rfc,R/) to be ||Rfe — R;||i. What is an appropriate diff(Rfc,Ri)? 

Familiarity with Wald’s Sequential Probability Ratio Test 0 immediately suggests that a relation like 0 should arise, but 
with perhaps relative entropM^ or its variant, in place of IjR^ — Ri||i. A variant may be called for because of the possibility 
of controlled actions. To see why, let us summarize the decision problem in the form of a question: 

One of the six images is odd. What would the prefrontal cortex (the decision center of the brain) do if it got 
observations (firings of neurons) from the human analogue of the IT cortex, and could control the eye (to gaze at one 
of the six objects)? The goal is to minimise the time to decide the oddball image and its location, yet keep errors 
within desired limits. 

One can model this decision problem as a sequential hypothesis testing problem with control. Naghshvar and Javidi, earlier 
in 0 and more recently in 0, call such a problem active sequential hypothesis testing (ASHT). ASHT suggests a natural 
candidate that we shall propose for diff(Rfc, R;). There is however one important modeling issue that we wish to bring to the 
attention of the reader. Figure [T] shows that the average reaction times in the experiments are between 250 ms and 1000 ms. 
However, it is known that a switch in focus from one search location to anotheij^has a cost per switch that ranges from tens 
of ms to sometimes even higher than 100 ms 0. To account for this, we extend ASHT to a setting with switching costs, and 
show that the diff(Rfc, R/) appropriate for the setting without switching costs works equally well with switching costs. 

^This refers to relative entropy of the probability measure of a set of d Poisson point processes with rate vector taken with respect to the probability 
measure associated with rate vector . 

^This rapid eye movement is called a saccade. 
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TABLE I 

Correlation with Different Information Measures 


Information Measure 

Correlation (l/s(- • •) vs. Discrimination index) 

p-value 

Proposed 

0.94 

5.2 X 10-12 

KL 

0.93 

8.5 X 10-11 

Chernoff 

0.94 

7.8 X 10-12 

o 

0.94 

6.1 X 10-12 


As with distance, so with our proposed diff(Rfe,R;), and indeed, with other natural dissimilarity indices like relative 
entropy and Chemoff entropjj^ Table indicates that all these dissimilarity measures have similar high correlation with the 
behavioural inde?(|^ Given that all these dissimilarity indices yield high correlation with the reaction times, does our proposed 
diff candidate stand out in some way? It is certainly grounded in a decision-theoretic framework as we shall soon see. But is 
there some experimental evidence in favour of our proposed diff candidate? We address this question as well and propose a 
method to rank order the dissimilarity measures in their ability to explain the experimental data of Sripati and Olson m. 

Prior Work on the ASHT model: Chernoff m studied ASHT in the context of designing optimal experiments. His 
performance criterion was the total cost of sampling, which is proportional to delay, plus a penalty for false detection. Chemoff 
proposed a policy, the so-called Procedure A, and showed its asymptotic optimality as the cost of sampling went to zero. 
Procedure A maintains a posterior distribution on the set of hypotheses and, at each instant, selects actions according to the 
hypothesis with the highest posterior probability. 

There has been a flurry of recent activity extending Chernoff’s work in other directions. In a series of works, Naghshvar and 
Javidi Q, 0, nni, CD, CD studied ASHT from a Bayesian cost minimization perspective. The total cost was the sum of 
decision delay and a penalty for false detection. They proposed policies, similar to Chernoff’s Procedure A, identified bounds 
on the total cost, and established their proposed policies’ asymptotic optimality in the same asymptotic regime as Chemoff’^ 
Nitinawarat et al. CD studied active hypothesis testing in fixed sample size and in sequential settings. They also minimize 
decision delay subject to a constraint on the conditional probability of false detection. When these conditional probabilities of 
false detection are driven to zero, the resulting asymptotic regime is the same as Chernoff’s. In this asymptotic regime, they 
obtained results similar to those of Chemoff’s but under milder assumptions. They also prove a stronger asymptotic result 
based on the “risk associated with a decision”. Nitinawarat and Veeravalli d extended ASHT to Markovian observations 
and non-uniform costs on actions. Recently, Cohen and Zhao studied anomaly detection from an ASHT perspective. They 
showed that, in their particular setting, a simple deterministic policy is asymptotically optimal. This is in contrast to random 
policies advocated in the other works. Further, for their particular setting, they showed the asymptotic optimality of Chemoff’s 
policy under milder assumptions. None of the above works consider switching costs associated with a change in action. 

Our contribution: Broadly, our contribution is a reinterpretation of the experimental results of Sripati and Olson m from 
a decision-theoretic standpoint. The following highlight some specific contributions. 

• We formulate the visual search problem as an ASHT problem with switching costs. We show that a modification of 
Chernoff’s Procedure A, one that we call Sluggish Procedure A, is asymptotically optimal even with switching costs. 
Further, we show that the growth rate of the total cost, as the probability of false detection is driven to zero, can be made 
arbitrarily close to that without switching costs. 

• We propose a neuronal dissimilarity index for the diff functional in lieu of the distance between the two vectors (Sripati 
and Olson’s proposed dissimilarity index in ID). Our proposed dissimilarity index is based on, but is not the same as, the 
relative entropy between two Poisson point processes with the specified firing rate vectors. 

• We test the goodness of this neuronal dissimilarity index with respect to by examining which comes closest to satisfying 

s(fc,0-diff(Rfc,Ri) = constant, (3) 

and which is farthest. We propose a comparison statistic based on the “equality of means” testing. We use three different 
equality of means tests to arrive at three different statistics. The first is the familiar ANOVA’s F-statistic. The second 
is natural too, and is the analogue of the F-statistic associated with the family of Gamma distributed random variables 
instead of Gaussians. The Gamma distribution, as we will later discuss, provides a better fit for the delay data. The third 

“^Relative entropy and Chernoff entropy are possible candidates because of the following. Consider a simple hypothesis testing problem where exactly one 
of two images is displayed and the problem is to identify which. The stopping version of the problem coiresponds to Wald’s sequential hypothesis testing. 
The expected stopping time to meet a certain error tolerance criteria e is roughly log(l/e)/(relative entropy) (3. When the decision is to be made after a 
fixed number of samples, where the number of samples is fixed upfront to meet a certain error tolerance criteria, the required number of samples is roughly 
log(l/e)/(Chernoff entropy). 

^In Table 1^ correlation values are based on scatter plots arising from ordered pairs of images. This explains why correlation value in the table (obtained 
from 24 points in the scatter plot) is marginally different from the correlation indicated in Figure^ (and obtained from 12 symmetrised points in the plot). 

^They also consider the asymptotics where the number of hypotheses is large. This is not of direct relevance to our study. 
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is similar to the second, but assumes a known shape parameter. All three methods’ rankings are consistent: our proposed 
dissimilarity index comes out as the best, with relative entropy coming a close second, in answer to the question: Which 
neural dissimilarity measure based on firing rates would be optimal from a decision-theoretic point of view? We must 
however add a sobering note that all three equality of means tests reject, in a rather spectacular fashion, the null hypothesis 
of equal means in ([^ at any reasonable level of statistical significance. So we emphasise that the test statistics are merely 
used to rank order the dissimilarity measures. 

• Our estimation of the proposed neuronal dissimilarity index requires a near unbiased estimate of relative entropy as an 
intermediate step. We suggest a procedure to arrive at a nearly unbiased estimate. This maybe of independent value. 

Organisation: The rest of the paper is organised as follows. Section [^studies the ASHT problem with costs for switching 
actions. Section [nl| applies the results of Section]^ to the visual search problem. Section [TV] develops the proposed neuronal 
dissimilarity index and discusses its performance through correlation studies and “equality of means” testing. Section[V|provides 
some summarising conclusions. The proofs are relegated to appendices andAppendix [C| details the technique used to get 
a near unbiased estimate of relative entropy of one Poisson point process with respect to another. 


IT The ASHT Abstraction 

In this section, we describe our mathematical model for visual search and collect all the relevant theoretical results. The 
development will be somewhat abstract. But we shall relate the model to visual search and shall apply the results to that setting 
in Section III The main contribution of this section is the asymptotic growth rate of cost. In Section IV we shall see how 


this suggests an appropriate diff function for plugging into (|^. 


A. The ASHT Model 

1) The model description: Let us begin by setting up some notation. 

Let Hi, i = 1,2,. .., M denote the M hypotheses of which exactly one, denoted H, holds true. In this section, we do not 
assume a prior on the hypotheses. Let A be the set of all possible actions which we take as finite: \A\ = K < oo. Let X 
be the observation space. Let (A'„)„>i and (A„)„>i denote the observation process and the control process respectively. We 
write X" for (Xi,..., X„) and similarly A” for (Ai,..., A„). We also write V{A) for the set of probability distributions on 

A. 

A policy TT is a sequence of action plans that at time n looks at the history A"“^ and prescribes a composite action 

that is either {stop, 5) or {continue, X) as explained next. If the composite action is {stop, 6), then the controller stops taking 
further samples (or retires) and indicates 6 as its decision on the hypothesis; 5 G {1, 2,..., M}. If the composite action is 
{continue. A), the controller picks the next action A„ according to the distribution A S 'P(A). Let t{tt) be the stopping time 

T(7r) := inf{n > 1|A„ = {stop, •)}. 

Consider a policy tt. Conditioned on action A„ and the true hypothesis H, we assume that X„ is conditionally independent 
of previous actions A"“^ = (Ai, A 2 ,..., A„_i), previous observations = (Xi, X 2 ,..., X„_i), and the policy. Let 

be the conditional probability density function, with respect to some reference measure p., of the observation X„ under action 
a when H = Hi. Let D{q°'\\qj) denote the relative entropjj^ between the conditional probability measures associated with the 
observations under hypothesis Hi and under hypothesis Hj, upon action a. Denote by unif(A) the uniform distribution on A. 
Let q'^ (a;”, a") be the probability density function of observations and actions (cc", a") till time n under policy tt, with respect 
to the common reference measure /r®” x unif(A)®". Let Z^{n) denote the log-likelihood process of hypothesis Hi, i.e., 

z:{n) = log q:{X^,A^). (4) 


Going forward, for ease of notation, we drop the superscript tt while describing q^, Z'^, and other variables, but their dependence 
on the underlying policy should be kept in mind, and the policy under consideration will be clear from the context. Define 
Z{n) = {Zi{n), Z2{n),..., ZM{n)). Let Zij{n) denote the log-likelihood ratio (LLR) process of Hi with respect to Hj, i.e.. 


Z,j{n) 


Z,{n) - Zj{n) 

g,(X",A") 


log 


g,(X",A") 


^log 


jXi) 

4' {Xi) 


Let Ei denote the conditional expectation and let Pi denote the conditional probability measure under H = Hi. (More formally, 
these should be represented and P^. But as done above, we omit the superscript tt.) 
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By an abuse of notation, we use the densities of the probability measures as the arguments of the relative entropy function. 
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Given an error tolerance vector a = {ai,a 2 , ■ • ■, ctm) with 0 < < 1, let n(a) be the set of policies 


n(Q;) = {tt : Pi{6 ^i) < a^, V i} . 

These are policies that meet a specihed tolerance for the conditional probability of false detection. We dehne ||a|| := maxj 
We dehne A; to be the best mixed action that guards Hi against its nearest alternativ^ i.e., A^ G 'P{A) such that 


\i := arg max 
AG-PM) 


min > 
a^A 


A(a)fA(g“|lg“) 


(5) 


If there are several maximizers, pick one arbitrarily. Further, dehne 


Di := max 
agpM) 


min > 
a^A 


Ha)D 


( 6 ) 


Let Aij := {a G A '■ D{qf\\qj) > 0}, the set of all actions that can differentiate hypothesis Hi from hypothesis Hj. Since 
DiqfWqj) = 0 D{q'^\\qf) = 0, we have Aij = Aji. 

2) Assumptions: Throughout, we make the following assumptions. 


(I) E, 


(log 




< oo V i,j, ct. 


(l la) Aij 0 yi,j such that i ^ j, and 

(l lb) /3 := min |EaG,4i, ^kia) \ I <i,j,k < M, i ^ j| > 0. 

Assumption (I) implies that D{q‘^\\qj) < oo, which in turn ensures that no single observation can result in a reliable decision. 
Assumption (I) is used in proving the lower bound on the expected number of samples needed to satisfy the tolerance criterion. 
This is also assumed by Chernoff a and Nitinawarat et al. ini. 

Assumption (Ila) ensures that for any distinct i and j, there is at least one control that can help distinguish the hypotheses 
Hi from Hj. If Aij = 0 for some i and j, it is impossible to distinguish them from each other. Assumption (lib) is a stronger 
assumption than, and implies. Assumption (Ila). Assumption (lib) ensures that if actions are taken according to any of the Afc 
in Q then, for any pair of hypotheses Hi and Hj, there is a positive probability of choosing an action that can discriminate the 
pair. We shall use Assumption (lib) in the achievability proofs of our policies. It allows for easier proofs for our policies, and 
makes the presentation simpler. However one can work with Assumption (Ila) as well, and construct asymptotically optimal 
policies, with minor modihcations to our policies. We will describe the needed modihcations later in this section. 

3) Switching cost and total cost: The costs are as follows. 

Switching Cost: Let g{a, a') denote the cost of switching from action a to action o'. Throughout, we make the following 
additional assumptions. 

(Ill) g{a, a') > 0 Va, a' G A, g(a, a) = 0 Va G A, and pmax := maxa^a/ g(a, a') < oo. 

The assumption in the middle says no switching incurs zero cost. This assumption will play a crucial role towards our eventual 
conclusion that switching costs do not matter in the asymptotics considered in this paper. 

Total cost: For a policy tt G n(a), the total cost C(7r) is taken to be the sum of the stopping time (delay) and the net 
switching cost, i.e., 

r('7r) —1 

C(7r) := T(7r) + ^ g(Ai,Ai+i). 

1=1 

4) Asymptotics: We shall be interested in the asymptotics of the minimum expected total cost i?i[C(7r)], minimized over 
policies in n(a), as ||a|| —> 0. Note that there are M such conditional expected total costs, one for each hypothesis. 


B. Results on the ASHT Model 


We collect all the main results in this section. We hrst identify a lower bound. 

1) The converse - Lower bound: The following proposition gives a lower bound on the conditional expectation of the 
stopping time, given hypothesis H = Hi, for all policies belonging to n(a). 

Proposition 1: Assume (I). For each i, we have 


where Di is given in 


lim inf AfcM 

|a ||->-0 7i-Gn(Q) I log ||a||| 


> 


D, 


( 7 ) 


This suffices because the probability of error is dominated by the nearest alternative hypothesis. 
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Proof: Since only expected time to stop is considered, proof of [H] Th. 2, p. 766] applies. 


We then have the following corollary. 

Corollary 2: Assume (I). For each i, we have 


lim inf 

|ck||—>-0 7r^n(Q:) 


|log||«lll 


> 


1 

A' 


Proof: With switching costs added, we have C{'k) > r 


(tt), and the corollary follows from Proposition 


( 8 ) 


2) Achievability - A modification to Chernoff’s Procedure A: Chernoff 13 proposed a policy termed Procedure A and 
showed that it has asymptotically optimal expected decision delay. We now describe Procedure A. 

Policy Procedure A: ttpa{L) 

Fix L > 0. 

At time n: 

• Let 9{n) = argmax^ Zi(n), the index with the largest log-likelihood at the current time. Ties are resolved 
uniformly at random. 

• If j(n) < log ((M — 1)L) for some j 9{n) then An+i is chosen according to Ae(„), i.e., 

Pr(A„+i = a) = Ae(„)(a). (9) 

• If Zg(^n),j(,ti) > log ((M — 1)L) for all j 7 ^ 9{n) then the test retires and declares i7e(„) as the true hypothesis. 


We now describe a modified policy that comes arbitrarily close to being asymptotically optimal in the presence of switching 
costs. We introduce a switching parameter rj, 0 < rj < 1, which determines the maximum transition rate out of a given action. 
When ry = 1, we will have the original Procedure A. When p approaches zero, the rate of jumping out of the current action 
approaches zero. 

Policy Sluggish Procedure A: TTsA{L,r]) 

Fix L > 0, 0 < 77 < 1. 

At time n: 

• Let 9{n) = argmax^ Zi{n). Ties are resolved uniformly at random. 

• If Zg(^n),j{ti) < log((M — 1)L) for some j 9{n) then An+i is chosen as follows. 

- Generate Un+i, a Bernoulli(r 7 ) random variable, independent of all other random variables. 

- If t7„+i = 0, then A^+i = 

- If Un+i = 1, then generate An+i according to distribution A^y^). 

• If j(n) > log (M — 1)L, for all j 9(n), then the test retires and declares i7e(n) the true hypothesis. 


We also consider two variants of 7 r 5 ^(L,p) which are useful in the analysis. 

• Policy This is the same as 7 r 5 ^(L,p), but stops only at decision i when Zij{n) > \og{L{M — 1)). 

• Policy ^sa{v)' This is the same as TrsA{L,rj), but never stops, and hence L is irrelevant. 

Under a hxed hypothesis H = Hi, and the triplet of policies ( 7 r 5 ^(L, ry), 7 rg^(L, 77 ), 7 fs^(? 7 )), it is easily seen that there is a 
common underlying probability measure with respect to which the processes (X„, A„)„>i associated with the three policies 
are naturally coupled, with only the stopping times being different. Under this coupling, the following are true; 


T{TtsA{L,v)) > r{TTsA{L,r])), 
{T{TTsAiL,r])) >n} C {T(7r^^(L,?y)) > n} 


C 


min Zij{n) < log(L(M — 1 )) 
r-j¥=i 


Policy 7 rs^(L,ry) is designed to stop only when the posteriors suggest a reliable decision. This is formalized now. 
Proposition 3: Assume (I) and (Ilb). For Policy 7rsA{L, ry), the conditional probability of error under hypothesis Hi is upper 
bounded by Pi(5 i) < 1/L. 

See Appendix A-A for a proof. As a consequence we have TTgA{L, rj) S n(a) if ai > IjL for every i. 


We now state the time-delay performance of the policy 7 rs^(L,ry). 

Theorem 4: Assume (I) and (lib). Consider the policy TTsA{L,ri). The conditional expected time to make a decision, for 
each i, satisfies 


lim 

L —^oc 


E^ [T{TrsA{L,v))] 
logL 


1 

< —. 
- D, 


( 10 ) 


See Appendix A-B for a detailed proof. This result will be crucial because the policy 7 r 5 ^(L,ry), despite its sluggishness 


induced by 77 , remains asymptotically optimal when only the stopping time t{ttsa{L, rj)) is considered as cost. We now leverage 
this to show that, if 77 is sufficiently small, TrsA{L,r]) is near optimal when switching costs are also taken into account. 
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Proposition 5: Assume (I), (lib), and (III). Consider the policy 7rsA{L,ri). We then have, for each i, 


lim E, 

L—>-oo 


C{TrsA{L,T])) 


logL 


< 


1 

A 


D, 


( 11 ) 


Proof: We can write the following chain of inequalities. 

Ei [CiTTSAiL.T]))] 

T(TTsA{L,ri))-l 

T{TTsA{L,r])) + ^ g{Ai,Ai+i) 


= E, 


Z=1 


< E, [T{TrsA{L,v))] + 5max^i 

< E, [T{7TsA{L,r]))] + 


T( 7 r 5 A(L,? 7 ))-l 


E 


1 


{Ai^Ai + i} 


1=1 

r{-irsA{L,7]))-l 

E 




= E, [T{TrsA{L,rj))] + 9ma,yi pEi [t{t:sa{L,vi)) - 1 ] 
< A [TiTTSAiL.Vl))] (1 + 

5 niax ■q). 


( 12 ) 


In the above chain, the second inequality follows from Assumption (III). The penultimate equality holds because of Wald’s 
equation M- Dividing by logL, letting L -A oo, and using Theorem]^ we see that ( [TT] l holds. ■ 

3) Asymptotic optimality: Corollary and Proposition show that, when the conditional probability of false detection is 
driven to zero, the proposed policy 775 ^ (L,p) has nearly the same growth rate for cost as an asymptotically optimal policy 
without switching costs. We now make the above statement precise. The parameter q should be suitably chosen to get sufficiently 
close to asymptotic optimality. 

Theorem 6: Assume (I), (Ilb), and (III). Consider a sequence of vectors where is the tolerance vector, 

such that lim„_>oo = 0 and 


lim 

n—>-oo 


• ('^) 

mirifc al 


< B 


( 13 ) 


for some B. Then, for each n, the policy TTsAiLmP) with logL„ = —logmin^a^"^ belongs to n(al”l). Furthermore, for 


each i. 


lim inf 


Ei ^ Ej [C{TTSA{Ln,q))] ^ J_ 

r;4,0 nfoo log Di 


( 14 ) 


ntoo- 7 rGn(aC^)) log Eyj, ? 7^0 n^oo log Ln 

Proof: The fact that 775 , 4 (L„,p) € n(al"^) is evident from Proposition 0 and ljLn< k = 1,2,-•• ,n. We then 

have the following chain of inequalities: 


— < lim 

inf 

A [C'(^)] 

Ei n^oo 

7rGn(aO 

A) log 

= lim 

inf 

A [C(77)] 


ntoo 7 i-Gn(a(")) log 

Ei [C(T:sA(.Ln,q))] 


< lim lim 

?74-0 n^oo 

1 

< — . 

" D, 


log A 


The first inequality follows from Corollary The next equality follows from the fact that 

I log 


lim 
n-s-oo log 


= 1 , 


which in turn is true due to the assumption ( [l3] l. The third inequality follows because 775,4 (L„, 77 ) is one specific policy in 
n(a”). The last inequality follows from Proposition after letting 77 0 . Consequently, all inequalities must be equalities. ■ 





















C. Discussion on Assumption (lib) 

Chernoff’s result on the asymptotic optimality of Procedure A ||9| was proved under a stronger assumption than Assumption 
(Ilb), namely, Chernoff required 


D{q°)\\q‘^) > 0 for all a and for all pairs i ^ j. 


(15) 


Assumption (lib) ensures that, at all times, and for any pair of hypotheses i and j, i ^ j, there is a positive probability of 
choosing an action that can distinguish the two hypotheses. This suffices for Chernoff’s proofs to go through. Specifically, 
we shall use Assumption (lib) to prove the exponential decay result in Proposition [T^ of Appendix Nitinawarat et al. 
lfT3l proposed a modified Procedure A that sampled actions randomly at intervals v > 1, and showed that their 


proposed policy is asymptotically optimal under the weaker Assumption (Ila). The random sampling enabled them to obtain 
a polynomial decay counterpart of Proposition 13 of Appendix [A| Recently, Cohen and Zhao ifTSl showed the asymptotic 
optimality of Procedure A under the weaker Assumption (Ila) for an active anomaly detection problem, which is a specific 
ASHT problem. We conjecture that Chernoff’s Procedure A is asymptotically optimal under the weaker Assumption (Ila) 
for all ASHT problems. A proof of this claim has remained elusive. Nevertheless, policies whose performances are provably 
arbitrarily close to the optimum can be designed. We make the above claim precise in the next proposition. 

Proposition 7: Assume (I) and (Ila). Fix e > 0. Then there exists a sequence of policies {ir^i^L)} that satisfies T^i{L) S 


lim Ei 

L—^OO 


\ogL 


1 

- (l-e)A' 


(16) 


We omit the proof because the needed modifications to the proof of Theorem are straightforward. Policy {7re(L)} can be 
constructed as a variant of Procedure A that, at each instant n, chooses an action according to unif(Zl) with probability e or 
according to (j^ with probability (1 — e). Note that, at each time n, the modified policy {7re(L)} uses a randomisation on 
the actions of the form \e(n) = (1 ~ e)^e(n) + eUnif(Zl). It can be shown that, under hypothesis Hi, 9{n) = i in finite time 
with probability 1, and thereby the asymptotic log likehood ratio rate between Hi and any other Hj will be lower bounded 
by {1 — e)Di. Thus, at the cost of a small penalty, we can design nearly asymptotically optimal policies under the weaker 
Assumption (Ila). A similar argument holds true with switching costs, just as Theoremis extended in Theorem]^ albeit with 
a corresponding but arbitrarily small increase in the total cost. Again, we omit the proof of this claim with switching costs. 
The conclusion is that Assumption (Ila) suffices for the asymptotic growth rate to be 


III. Back TO Visual Search 

We now return to the visual search problem. In the visual search task, a subject has to identify an oddball image from 
amongst W images displayed on a screen (W = 6 in Figures 0-121. For the purpose of modeling, we make the following 
assumptions. The subject can focus attention on only one of the W positions, and the field of view is restricted to the image at 
that position alone. Further, we assume that time is slottecj^ and each slot is of duration T. The subject can change the focus of 
his attention to any of the W image locations, but only at the slot boundaries. A switch in focus of attention (saccade) requires 
an integer number of slots for the operation, and no sensing is possible during such a saccade. The lost time during saccades 
are modeled as switching costs (delays), and hence the total decision time is the sum of sensing delay and switching delays. 
We assume that the subject would have indeed found the exact location and identity of the oddball image before mapping it 
to a “left” or “right” decision. These are clearly oversimplifying assumptions, but enable easier analysis and provide valuable 
insights. 

If the image in the focused location is Ik, we assume that a set of d neurons react accordingly to produce spike trains. These 
constitute the observations. Specifically, these are modeled as d independent Poisson point processes of duration T with rates 
given by the components of the rate vector R/j = {Rk{l), Rk{‘2^), ■ ■ ■, Rk{d)). More formally, let X be the space of counting 
processes in [0,T] with an associated cr-algebra. Let fii^T be the standard rate 1 Poisson point process and let p.f'^ be its 
d-fold product measure. Let denote the probability measure Pk, so that density of with respect to is given 

by . 

_ dp-R.^„T 

’ 

with a similar definition for /; corresponding to image 

We now describe two possible settings. Case 2 will turn out to be closer to the experiment of Sripati and Olson m. 

Case 1: The subject has knowledge that the oddball image is R and that the distractors are R. Since there are W locations, 
and 1 < i < W, there are W hypotheses. 


®One could also consider an extension to the continuous-time setting. But all essential ideas are best described in the slotted setting. 
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The visual search problem under Case 1 can be formulated as an ASHT problem as follows. 

• Hypotheses: Hi is the hypothesis that the oddball image {Ik) is at location *,!<*< W. 

• Actions: The subject may focus on any one of the W locations, and so A = {1,2,... ,W}. 

• Observations: The conditional probability density function qf of the observations, under hypothesis Hi and when action 
a is chosen, is: 



if a = i 
if a ^ i. 


In words, under Hypothesis Hi, the oddball image is Ik and is at location i. If the action is to focus on location i, i.e., 
a = i, then the subject views the oddball image Ik, and so the observations have density fk. If a ^ i, then the subject views 
the distractor image Ii, and so the observations have density /;. 

The relative entropies for the various combinations of hypotheses pairs {i,j), with i A j, and actions are as follows: 


(D{fk\\fi) a = i 
D{qt\\q)) = loiMfk) a = j 

[O a^i, j. 

Proposition 8: For the setting of Case 1, the Ai and Di of and respectively, are as follows. 
If DifkWfi) > D{fi\\fk)l{W - 1) then 

\{i) = l,Xi{j) = OVj ^ h and A = D{fk\\fi). 


(17) 


If D{fk\\fi) < D{fi\\fk)/{W - 1) then 

Xi\y} — 0 , Xi\3) — ^ ~ 

Proof: We can upper bound Di as follows: 


Di = inax min^ A(a)iA(g“||g“) 

= max min[A(i)iA(/fe||/,) + A(j)£>(/i||/fc)] 
X&V{A) 


(18) 

(19) 


= max 
X(^V{A) 

< max 
\&V{A) 


X{^)D{fk\\fl) 

xmifkWfi) 


- mln\{j)D{fi\\fk) 

3^^ 


1 - ^(») 

w-i 


DifiWfk) 


_ f D(fdfi) if D(fkHfi) > ^^,by setting A(*) = 1, 


j o(Mh) 
w-i 


if f9(/fe||//) < setting X{i) = 0. 


( 20 ) 

( 21 ) 

( 22 ) 


Here ([T^ follows from ( [T7] i, ( |20| i follows after taking the minimisation inside, ( [ST) follows because the minimum of a set 
of numbers is upper bounded by their arithmetic mean, and ( [22) i follows by maximising the linear objective function in ( (2T] i. 
Finally, ( [2T] ) can be made an equality by choosing all Xj ,j=/Iitohe identical. This proves the Proposition. ■ 

Thus, under Hi, to distinguish A from its nearest alternative, one either focuses only at the oddball location or at any of 
the other locations with equal probability depending on whether D{fk\\fi) > D{fi\\fk)/{W — 1) or not. 


Case 2: The subject has knowledge of the two competing images Ik and Ii, but does not know which of the two is the 
oddball image. 

This visual search problem can be formulated as a 2W hypothesis testing problem as follows. 

• Hypotheses: 

Hi with i <W : The oddball image is Ik and is at location i. All other locations have image 
Hi with i > W : The oddball image is f and is at location i — W. All other locations have image Ik- 

• Actions: The subject can focus on any one of the W locations, and so Al = {1, 2, • • • , W}. 

• Observations: The conditional probability density function qf of the observations, under hypothesis A and when action 
a is chosen, is: 
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Qt = 


fk 

fi 


i < W, a = i 
i < W, a^i 


Qi = 



i > W, a = i — W 
i>W, a^i-W. 


In words, under Hypothesis Hi with i < W, the oddball image is and is at location i. If the action is to focus on location 
i, i.e., a = i, then the subject views image Ik and so the observations have density fk corresponding to Ik- The outcome of 
other actions for this hypothesis are explained similarly. An analogous description holds for outcomes of actions under Hi 
when i > W. 

The relative entropies for the various combinations of hypotheses pairs (i j) and actions are as follows. The expressions 
are self-explanatory. 


i<W,j<W: 


(23) 

1 

\Difk\\fi) 

a = i 


DiqfUj) = 

iDifiWfk) 

a = j 

(24) 

1 

[o 

ayfa^j. 


i < W, j = i 

+ FF: 


(25) 

DiqtWq-) = 1 

iDifkWfi) 

[DifiWfk) 

a = i 

a ^ i. 

(26) 

i<W, j >W, j yi + W: 

(27) 

1 

fo 

a = i 


Diqth]) = 

r 

a = j-W 

(28) 

1 

[DifiWfk) 

a^i, a^j — W. 



(iv) For i > W, the expressions for j > W, j = i — W, or j < W but j i — W are similar to (i), (ii), and (in) above, 
respectively, but with fk and fi interchanged. 

We now identify the structure of Xi and Di for i = 1, 2,..., 2W. 

Proposition 9: Let VF > 3. Let i < W. For the setting of Case 2, the optimum Xi and Di of 0 and 0, respectively, are 
as follows. If D{fk\\fi) > D{fi\\fk)/{W - 1) then 


A {W-i)D{Mfk) 

(VF - l)D{fk\\fi) + (FF - i)D(Mfk) ’ 

w _ D{h\\fi) _ 

(IF - l)D[fk\\fi) + (IF - i)D{fi\\fk) 
n {W -2)D{fk\\fi)D{h\\fk) 

* {W-l)D[fk\\fi) + {W-i)D{fi\\fky 
If D{fk\\fi) < D{fi\\fk)/{W - 1) then 

X^{i) = 0, A,(j) = Vj y i, and D, 


Vj y i, and 


DjfiWfk) 
(FF-1) ■ 


For i > W, Xi and Di have the same structure as above, but with fi and fk interchanged. 
For a proof, see Appendix 


(29) 


IV. Proposal for a neuronal dissimilarity index 

We now apply the results obtained in the previous section to the data from the experiments of Sripati and Olson Q. The 
visual search experiments of Sripati and Olson H] on human subjects correspond closely with Case 2 of the previous section. 
Similar to Case 2, the subjects in the experiments had no prior information on which of the two images Ik and 7/ was the 
oddball image and which the distractor. But different from Case 2, the subjects in the experiments had to learn about the 
images Ik and 7; on-the-go, while in Case 2 we assume that the subject knows that the oddball and distractor images come 
from the set {7^, 7;}. A more accurate modeling that takes the learning aspect into account is work in progress. Here, we shall 
proceed with the Case 2 model. 

Recall that T is the slot duration during which the subject focuses attention on a particular image. First, we calculate the 
relative entropy D{fk\\fi) when fk and fi are densities of vector Poisson point processes of duration T with rates = 
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(i?fc(l), i?fe(2),..., and R; = Ri{2),..., Ri{d)). Under the assumption that the neurons fire independently 

with the specified rates, the relative entropy decomposes into a sum; 


D (AiRfc.rllfiRi.T) = 

d 


log 


dn 


Rfc,T 


m—l 

d 


MR,fc(m),T 


log 


dfJ,Ri,T 


m—l 


Rk{m) log ( ) - Rk{m) + Ri{m) 

Ri{m) 


where the term within square brackets in the last summation is the relative entropy of the Poisson point processes with rate 
Rk{m) taken with respect to another such process with rate Ri{m). 

In Case 2, if the number of locations VP = 6, if Ik is the oddball image, if is the distractor image, and if D (/iRfc,T||MRi,T) > 
D (MRi.TlI/iRfc.T) /{W - 1) which is the case when D (MR,,Tl!MRfc,T) is close to D (MRfc.TllMRi.r), then from Proposition 
we have 


Dkl = 


Dlk = 


5£i(MRfc,T|lMRi,T) + 3D(/rR,,T||MRfc,T) 

distractor image, and if D 

4£>(MR,,T||A^Rfc,T)-P(AtRfc,T||MRi,T) 
5Ci(/rR,,T|!MRfc.T) + 3£)(/iRs^,T|!MR;.T) 


(30) 


Similarly, if /; is the oddball image, if Ik is the distractor image, and if D (^R,,T||MRfc.T ) > R (AtRfc,T||pRi,T) /(hP — 1), we 
have 


Let us normalize Dki per unit time and per neuron and denote it Dkf. 

1 


Dkl = 


dT 


-D 


ki¬ 


rn 


(32) 


The subset of experimental data gathered by Sripati and Olson that we use in our analysis consisted of the following. 

1) Neuronal firing rate vectors were obtained from the IT cortex of rhesus macaque monkeys for twenty four images. The 
number of neurons ranged from 78 to 174, the variation was due to experimental constraints. But the sets of neurons tapped 
were identical for images that were to be paired in the decision time experiments on human subjects, which we describe next. 

2) Decision time statistics for detection of the oddball image were obtained from experiments on human subjects. For oddball 
image Ik and distractors //, data was collected as follows. Six subjects participated and each was shown twelve stimuli. In each 
stimulus, the oddball location was picked uniformly at random from the VP = 6 locations. The decision time were averaged 
across various subjects and across stimuli instances to get s{k,l). The first argument k stands for the oddball image R- 

Recall from Case 2 that Hi, when i < lU = 6, is the hypothesis that the oddball image is R and the distractor images are 
Taking a cue from Theorem assuming a sufficiently stringent error tolerance vector of (a, a,..., a) for a sufficiently 
small, and assuming nearly optimal decision making, we predict that 

log(l/a) 


eAC{t^)] 


Dkl 


where C{'k) models the total decision time, the sum of sensing delay and switching delays. Averaging across i = 1,, VP, 
i.e., averaging across all those stimuli where R is the oddball image and /; is the distractor, we get 

log(l/a) 


E[C{7r) I Ik is the oddball and /; is the distractor] 


Dkl 

(l/dT)log(l/a) 




kl 


or in other words 


For i > W = 6, one similarly has 


This naturally leads to the proposal 


s(fc, 1) ■ Dkl ~ constant. 


s{l, k) ■ Dlk ~ constant. 


diff(Rfc,R/) = Dkl. 


( 33 ) 
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A. Correlation study 

The behavioural dissimilarity index for an ordered pair of images {k,V} is the inverse s{k, l)~^ of the average decision time 
s{k, 1), and gives an indication of the speed of discrimination. In Figured we plot the behavioural dissimilarity index s{k, l)~^ 
against the proposed neuronal dissimilarity index Dm and against the LA dissimilarity index for various ordered pairs (fc, Z). 
We observe a strong correlation of 0.94 for Dm which is the same as the correlation between the behavioural dissimilarity 
index and the distance ||Rfc — R/||i. 




Fig. 2. The observed behavioural dissimilarity index versus the proposed neuronal dissimilarity index (D) and the -neuronal index. 




Fig. 3. The observed decision time versus the inverse of the proposed neuronal index {1/D) and the inverse of the L^-neuronal index. 

Now that we have discovered that our proposed neuronal dissimilarity index and the index are both equally well- 
correlated with the behavioural dissimilarity index, it is natural to ask if there is some basis to choose one over the other. The 
point that our proposed dissimilarity index has a “microscopic basis” (grounded in decision theory and based on the ASHT 
framework) that explains the “macroscopic observations” (speed of discrimination) is certainly in our favour. But there are 
other related dissimilarity indices such as relative entropy (KL) and Chernoff entropy that have similar correlation with the 
behavioural dissimilarity index. Table [n| summarises the correlations (second column) along with their p-values (third column). 
It is therefore natural to ask if a finer examination of the experimental data can help us identify the “best” among these neuronal 
indices. We shall pursue this in the next subsection and shall propose a method to rank order the indices in terms of their 
ability to explain the experimental data. 

A more basic question, and one that is motivated by our expectation that s(/c, 1) ■ diff(R/c, R;) = constant, is whether it is 
more appropriate to correlate s{k,l) versus diff(Rfc, R;)”^ as opposed to what is done in Figure which correlates s(fc,Z)-i 
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TABLE II 

Correlation with Different Neuronal Dissimilarity Indices 


Information Measure 

CoiTelation (1/s vs. Neuronal index) 

p-value 

CoiTelation (s vs (Neuronal index) ^) 

p-value 

Proposed D 

0.94 

5.2 X 10-12 

0.89 

4.3 X lO-oi* 

KL 

0.93 

8.5 X 10-11 

0.90 

3.1 X lO-oi* 

Chemoff 

0.94 

7.8 X 10-12 

0.88 

2.1 X 10-“® 


0.94 

6.1 X 10-12 

0.88 

1.1 X 10-“® 


versus diff(Rfc,RO. Table reports these correlations (fourth column) along with the corresponding p-values (hfth column), 
and Figure provides the correlation plot. The new correlations, though high, are lower than those reported in the second 
column. 

We do not have a clear-cut answer on which of the two scatter plots - 

(s(fc, 0,diff(Rfc,R/)-^) or (s(fc, cliff(Rfc,R;)) (34) 

- and the corresponding correlations is more appropriate. However, recall that Pearson’s test for rejecting the null hypothesis 
that a bivariate normal has independent components is that the correlation statistic r arising from independent and identically 
distributed samplings of the bivariate normal has |r| exceeding a threshold. Given that s{k, 1) is the arithmetic mean of n = 72 
experimentally measured decision time, when centred and scaled, s{k, 1) is likely to be closer to normal than its inverse. We 
therefore believe the correlation of (s(A:, ^), diff(Rfc, R;)”^), the one that leads to lower correlations, is more appropriate. The 
indicated p-values, shown in Figures and and in Tables |I] and |I^ are the probabilities that the correlation statistic equals or 
exceeds the indicated observed levels when the null hypothesis is true (independent components). 


B. Model testing via three “equality of means ” tests 

In Section]^ we posed the question of identifying a suitable diff function that satishes 

s(fc, 1) ■ diff(Rfc,R/) = constant. 


(35) 


In the previous section, we modeled visual search as an ASHT problem and proposed the diff given in (33 i, denoted D. 
However, we also saw that the candidates L^, Chemoff entropy, relative entropy, and D, all yielded high correlation with the 
behavioural dissimilarity index. We now address the question of which of these dissimilarity indices best explain the data. 
Our methodology is as follows. Consider a hxed diff(Rfc,R;) function. Let us test the new null hypothesis: 


{Hq) : E[C{'k) I Ik is the oddball and Ii is the distractor] • diff(Rfc,R;) = constant, 

where C{7r) is the decision time for a hxed error tolerance on the ordered image pair {Ik,Ii). The decision time data across 
subjects and across multiple stimuli that have Ik as the oddball and /; as the distractor images constitute one group associated 
with the ordered pair (fc, 1). Hq hypothesises that the diff-scaled means is constant across groups. Let us identify the diff indices 
for which the corresponding null hypothesis is accepted for a desired signihcance level. If the test passes for diff(Rfe, R;) = Dki, 
then there is signihcant evidence that the data is well-explained by our theory. 

To perform this test, we must do the following for each diff candidate. 

• Identify a test statistic T(diff) for testing equality of means of the diff-scaled decision times. Note each diff leads to a 
separate hypothesis test. 

• Accept or reject the corresponding null hypothesis for a desired level of signihcance. 

Let Tfc ;(j) be the jth sample in the group indexed by {k, 1). Let n denote the common number of samples in each group, 
and let g be the number of groups. The experimental data of Sripati and Olson had 24 groups and 72 samples per group; 
n = 12 and g = 24. The number of samples in each group was identical. 

1) Test 1 - Oneway ANOVA: : The one-way analysis of variance (ANOVA) statistic ini is often used to test equality of 
means across groups when the samples are Gaussian and when the variances across groups are the same. This test is known 
to be robust to the Gaussian assumption. It is also known to be robust to the equality of variances assumption so long as the 
number of samples is the same across groups M p.243]. As we will soon see, we neither have Gaussianity nor equality of 
variances across groups. But since the number of samples is the same across the groups, we may still use the oneway ANOVA 
test. 
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Let Tkj{j) := Tk^i{j) ■ diff(Rfc,R;). Write the sample means, the mean across groups, and the pooled variance as follows. 

1 ” 

rf 


Tk,i = 


T = 


s: = 


f=i 

1 ^ 

9 


/ , Tk^i 
(fc.O 

1 


gin - 1) 




(k,l) i=l 


Note that these depend on the diff index under consideration. The oneway ANOVA test ifTTl p.533] is as follows; Reject Hq 
(associated with the diff under consideration) if 


r(diff) := 


/ I 

(k,l) 


(Tk,i 


-T 


SI 


^ ig —l,3(n—l),o 


where a is the desired signihcance level and F^_i g^^-D.a is the corresponding thresholch^ 

Look at the second and third columns of Table III The first row contains the value of the ANOVA statistic with diff (Rfc, Rj) = 
Dki and the corresponding p-value. The p-value is so small that we must summarily reject the null hypothesis Hq associated 
with diff = D (for, say, a typical significance level of 5%). The situation is the same for the other dissimilarity indices, as can 
be seen from the remaining rows of Table m In each test, the null hypothesis is rejected for, say, the typical 5% significance 
level. 

Observe that the values of T{D), T(KL), and T(Chernoff) are close to each other while T(L^) is signihcantly larger. This 
suggests that one could use T( ) to rank the different dissimilarity measures in their ability to explain the observed data. The 
oneway ANOVA statistic suggests the ranking 


L* > KL > Chernoff > L^. 


(36) 


We shall return to this observation after trying out two other refinements of the equality of means test. 

2 ) Equality of Means for Gamma Distributions: We began with the oneway ANOVA statistic because it is known to be 
robust to the Gaussian assumption. We checked for Gaussianity anyway. Lilliefor’s test for Gaussianity is a variation on the 
Kolmogorov-Smirnov test when the null hypothesis does not specify the parameters of the Gaussian distribution. None of the 
24 groups of data passed the test of Gaussianity at the 5% significance level. 

We next looked for features in the data that may suggest other distributions. First, the decision times are positive random 
variables. Next, Figure shows the standard deviation versus the mean decision time for the 24 image pairs. Observe the linear 
relation, with y = 0.61a; being the best linear fit. A class of distributions on IR+ whose standard deviation is a linear function 
of its mean is the family of Gamma distributions with a fixed shape parameter. The Gamma density with shape parameter s 
and scale m is 


The mean is ms and the standard deviation is my/s so that standard deviation to mean ratio is 1 /y/s. The slope of 0.61 in Figure 
1^ suggests a shape parameter of 1/(0.61)^ = 2.7. The Kolmogorov-Smirnov test on the data against Gamma distributions with 
shape parameter 2.7 and mean set to the sample mean accepts 18 of the 24 image pairs and rejects 6 out of 24 image pairs 
at 5% signihcance level. This suggests that Gamma is a better ht to the data than Gaussian. 

We conducted a generalised likelihood ratio test (GLRT) for equality of means under the Gamma assumption, and under 
a constant shape parameter assumption. This corresponds to an “equality of scales” test; Shiue et al. El suggest a statistic 
analogous to the oneway ANOVA but for Gamma distributions. In Figure]^ we plot the CDF of the GLRT statistic under the 
null hypothesis and under equal mean and equal shape parameter assumptions. Note that the CDF of this statistic is robust to 
the shape parameter. 

In column 4 of Table III we provide the GLRT statistics for the decision time data. If we compare the GLRT statistics from 
Table III against GLRT CDF in Figure we observe that the statistics is well beyond the 5% significance point. Again, we 
must summarily reject each of the equality of means hypotheses. Indeed, each GLRT statistic is off the chart in Figure 
However, direct ordering of the statistics suggests the ranking ([36|, the same as that obtained with ANOVA. 


’®The threshold at which the cdf of the F-distribution with (g — 1, g{n — 1)) degrees of freedom equals 1 — a. 
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Fig. 4. Standard deviation of the decision times versus mean decision times, across image pairs. 


5 ) Equality of Means for Gamma Distributions with a fixed and known shape parameter: Consider the setting where the 
shape parameter is known to be s across the groups. The GLRT for equality of means under this setting can be straightforwardly 
shown to be s log(AM/GM) where AM and GM are the arithmetic and geometric means across groups dehned as follows; 


1/9 


AM = T = 


^ fk,i and GM = I fk,i 


{k,i) 




The last column of Table III shows this statistic for the various dissimilarity measures. Yet again, the statistics are off the chart 
(CDF not plotted) with so small p-values that the null hypothesis of equality of means must be rejected. Once again, direct 


ordering of statistics suggests the ranking (36i. 

4) A lesser goal - Ranking: We saw that all three equality of means tests reject all four dissimilarity measures. In retrospect, 
this might have been anticipated. If the test associated with D had passed, that would have been a spectacular conhrmation 
of our theory, which we really did not expect due to the crudeness of our modeling. Nevertheless, the equality of means test 
statistic provides a means to check which of the four dissimilarity measures best explains the data. 

A little thought will inform us that all three equality of means tests check how clustered the sample means are, across 
various groups. This is clearest in the third equality of means test (for Gamma distributions with a known and common shape 
parameter across groups) where the statistic is a monotone function of the ratio AM/GM. The test passes if AM/GM is small, 
that is if the group means are close to each other, and fails if it is large. 

The columns of Table corresponding to each test statistic suggest that the data points are most clustered under D scaling 

1 n o f / -L -1*0 rr -i o o c -i-r 


and least clustered with scaling. The ranking is as in (36i. 


Let us note D, KL, and Chernoff are close to each other, and a distant fourth. Indeed, the vector {T(^k,i)/T)(^k,i) associated 
with the dissimilarity measure majorised ( lIMl Defn.A.l]) the other three. There was no such ordering among D, KL, and 
Chernoff. 


TABLE III 

Equality of means test. Various Statistics 


diff 

ANOVA statistic 

ANOVA p-values 

Gamma GLR 

slog(AM/GM) 

Proposed D 

06.30 

9.35 X 10-19 

0.0533 

0.0600 

KL 

06.68 

2.88 X 10-90 

0.0561 

0.0633 

Chernoff 

06.74 

1.61 X 10-90 

0.0663 

0.0756 

Ll 

24.00 

3.42 X lO-si 

0.1652 

0.2061 


V. Discussion 

We modelled the visual search task of Sripati and Olson III as an active sequential hypothesis testing problem (ASHT). We 
extended the ASHT results of Chernoff @ to the case with switching costs. We showed that adding switching costs does not 
affect the asymptotic growth rate of the total cost. 

The ASHT model suggests a dissimilarity index between pairs of images. The inverse of the asymptotic growth rate of 
the total cost in the ASHT model is proposed as a dissimiliarity index between pairs of images. We derived expressions for 
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Fig. 5. Supeiposition of CDF of GLRT statistic under the Gamma assumption for the equality of means test. Each CDF consists of 1000 sample points. 
Each CDF corresponds to a random instance of mean and shape parameter. Mean was uniformly sampled from [0.2,1.2]. Shape parameter was uniformly 
sampled from [2,5]. Each sample point consisted of 24 groups, and 72 samples per group, same as that for the experimental data for decision times. The 
indicated intervals for mean and shape were based on the experimental data. 


computing the proposed dissimilarity index for the specific search task considered by Sripati and Olson ||T|. The proposed 
dissimilarity index is a function of the neuronal firing rates elicited by the images in the infero temporal cortex of macaque 
monkeys. 

Correlation study indicated that the proposed index is as good as and other dissimilarity measures such as the Chemoff 
entropy and the relative entropy (KL). Equality of means testing indicated that the equality of means hypothesis should be 
rejected, and this can be done with overwhelming confidence. 

Equality of means testing procedure is perhaps a rather stringent test. What would be an appropriate test if, say, we can 
leave one group out? Does our proposed neuronal dissimilarity index pass such a less stringent test? Can we leave two groups 
out? Which two? We do not yet have a principled way to address these questions and instead decided to stick to the strictest 
test. 

The statistics associated with the equality of means testing, however, suggested a ranking of the dissimilarity measures. We 
proposed three different statistics. Each measures the spread across groups of the group sample means. One of them is the 
familiar AM/GM ratio. The ranking was consistent across the three different statistics. Our proposed index was ranked first, 
relative entropy (KL divergence) and Chemoff entropy were a close second and third, and was a somewhat distant fourth. 

The decision times were tested for the Gamma distribution and the test passed for two-third of the groups. The shape 
parameter for the distributions of delay, estimated via the method of moments, was close to 3. 

In our work, we took only valid trials, i.e., those where the decisions made by the subjects were correct. We also assumed 
that the error probability tolerance were the same across subjects. It would be interesting to model speed-accuracy tradeoffs 
and see how they vary across individuals. It would also be interesting to explore how they vary for a single subject under 
different incentive settings. 

Extension of ASHT to the case when no prior information is available about the images, where the subject has to actively 
learn and R/ on-the-fly, is an interesting learning problem that is currently under study. 


Appendix A 

Properties of log-likelihood ratio processes under TTsAiL,r]) 

We will now show some desirable properties of the log-likelihood ratio processes under the policy ttsa{L, t]). These properties 
are analogous to those of classical sequential hypothesis testing, but their analyses are more involved because actions introduce 
1) dependency in the log-likelihood ratio increments, and 2) the increments are no longer identically distributed. The properties 
we will establish will be useful in forthcoming proofs. 

Define AZji{n) = Zji{n) — Zji{n—1). We then have AZji{n) = —AZij(n). Here, AZji{n) is the increment in the process 
associated with the log-likelihood ratio of Hj with respect to Hi at time n. We now show that under Assumptions (I) and (lib), 
and under policy TrsA{L, rf), the log-likelihood ratio processes are well behaved in the following sense: the log-likelihood ratio 
of the true hypothesis Hi with respect to any other hypothesis Hj has a positive drift. This will be made precise in Proposition 
13 Towards that, we first establish the following lemmas. 
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Lemma 10: Assume (I) and (Ilb). Fix i, j such that j ^ i. Let a G Aij. We then have, for all 0 < s < 1, 


pUs) :=E, = a 


< 1 Vn. 


(37) 


Proof: The following sequence of inequalities hold: 


E,, 


f ' 

Jx^x 


qt(x)dx 


ixex 


< 


q'^{x)dx 


’xex 


q'^{x)dx 


’xex 


(38) 


= 1 . 


The strict inequality in (38 i follows from Holder’s inequality and the fact that a S Aij implies qf and are not linearly 


related. 

The above result was obtained by conditioning on the action to lie in the desirable set Aij. The result is independent 
of the underlying policy, because when conditioned on the current action An, the observation is independent of the policy. 
Recall that ttsa{v) is the non-stopping variant of Trgji{L,r]). Further, recall from Assumption (lib) that we have f3 = 
(SaGylij A/c(a) I ^ < i, j, k < M, i j'^ > 0. Now we show that, under Assumption (lib) and policy nsAiv), a similar 


result holds, but without conditioning on the action An- First, let us define 

Pij{s) := r]P ^max (s)^ + (1 - V^)- 


(39) 


The fact that Pij{s) < 1 is evident from Lemma 


10 


Lemma 11: Assume (I) and (lib). Consider the policy ttsa{v)- Fix i. We then have, for all 0 < s < 1, 


E,, 


A^-i] < < 1 Vn,Vj ^ i. 


Proof: The following sequence of inequalities hold as described after the last inequality. 


E,, 


gSAZji(n) I 1 

= E^ \e. 


^sAz,,{n) 

= P,{An = a\X^-^A^-^)Ei = a 

aGA 

< P^{An e max Ei 


rn— 1 A 71— 1 

rn—1 /in —1\ 




+ (l-P,(A„e 

< ( max (s) ) + (1 - ?7/3) 


a^Ai 


< 1 . 


(40) 


(41) 


(42) 


Equality (40i holds because conditioned on A„ = a, AZij{n) is independent of the remaining history. Inequality (41 1 holds 
because, when a f. Aij, we have AZijin) = 0. The penultimate inequality is a consequence of the fact that, under TTsA{L,r]), 
one will choose an action a G Aij with probability at least p/3. ■ 

We now proceed to show an inequality analogous to the Chemoff bound for the log-likelihood ratio. In classical sequential 
hypothesis testing, due to independence of samples across time, the expectation of the likelihood ratio can be split as the 
product of the expectation of the likelihood ratio increments, as follows: 


E,, 


sZji(n) 


=n®. 

fe=i 


„sAZji{n) 


The same decomposition is not valid in ASHT because actions introduce dependency in the likelihood ratio increments across 
time. However, we can obtain an upper bound of the product form. 

Lemma 12: Assume (I) and (lib). Consider policy irsAiv)- Fix i. We then have, for all 0 < s < 1, 

E, < (p,^.(s))" Vn,Vj^*- 
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Proof: Once again, we proceed through the chain of inequalities all of which are now self-evident; 


E,, 


sZji (n) 


= E, 

= E,, 


E, 


gSZji(n-l)gSAZji(n)|j^n-l^ ^n-l 
gSZji(ra-l) gSAZji(n)|j^n-l^^n-l 


— Ei 

< { p ^ Jis ) y ' 




(from Lemma [n} 


where the last inequality follows by induction. ■ 

We now show an exponential decay property of the log-likelihood process which primarily stems from the anticipated 
negative drift in Zji{n) for j i. Let us alert the reader that in the following Proposition we deal with Zij{n) = —Zji{n). 
Proposition 13: Assume (I) and (Ilb). Consider policy TTsAiv)- Fix i- There exist constants Ck > 0 and 7 > 0 such that 


Pi ^nunZy (n) < < Ckb 


(43) 


Ck is independent of i, but 7 may depend on i. 

Proof: This follows from the previous lemmas via the following : 


Pi min Zij (n) < K \ = Pi I max Zji (n) 

V / \ f5^* 


> -K 


Zji{n) 


<J2P^iZ,yn)>-K) 


(44) 

(45) 

(46) 




< e 
= Ckc-^^ 


(M - 1) • max{pij{s)y 


where maxj^i pij{s) = e and Ck = Me®^. The inequality in (44i is due to the union bound, the inequality in (45l is due 
to Chernoff’s bound with 0 < s < 1, and the inequality in (|46ll is due to Lemma 12 


We now show that under the hypothesis H = Hi, the 9{n) process eventually settles at i. Indeed we show something 
stronger. Let us define 


Ti := inf{n : 0{n') = i, 'in' > n}, 


(47) 


the time at which d{n) meets its eventuality of settlement at i. This random variable has a tail that decays exponentially fast, 
as shown next. 

Lemma 14: Assume (I) and (lib). Consider policy Fix i. Then there exist C > 0 and 6 > 0, both finite and possibly 

dependent on i, such that 


Pi (T, >n)<Ce 


— bn 


(48) 


Proof: By the union bound 

Pi {Ti > n) = Pi{9{n') y i for some n' > n) 

< ^ P. {9{n') y z) 

n'>n 

< ^2 Pi < 0 j . 

n'>n ^ ^ 

The assertion now follows from Proposition [T 3 ] ■ 

Thus far we have considered the policy irsAiv) which never stops. We now show that the policy 'KsA{L,rj) stops in finite 
time. 

Proposition 15: Assume (I) and (lib). Consider the policy TrsA{L,r]). Fix i. We then have 

Pi{T{TrsA{L,r])) <oo) = 1. 
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Proof: We consider Trgj^{L,r]) for analysis. Recall that T{TTsA{L,ri)) < r( 7 rg^(L, 77 )), and hence it is sufficient to show 


that 


PiiTi'^SAiL.'n) < 00 ) = 1 . 


(49) 


From Proposition 13 we know that, for a suitable constant C, 


P, < log(L(M -!)))< 


Since this bound is summable, by the Borel-Cantelli lemma, 


Pi immZij{n) < log(L(M — 1 )) infinitely oftenj = 0 , 


which is stronger than the assertion (49 1 . 

Propositions 13 and 15 are the ones that will be used in the sequel. 


A. Proof of Proposition 

The proof relies on a standard change of measure argument. Let denote the event that the policy ttsa{L, v) declares Hj 
as the true hypothesis. 


Pii^ ^i) = = j) + PiiTiTTSAiL,?])) = 00 ) 

jAi 

f dP,{u:-)+0 

j^i n >0 * 




dP. 


= EE/ ^{-ndPA-n 

^ -dPji^n 


^EE 

j^i n >0 ‘ 

1 


< 


(M- 1 )L 




(50) 




1 

< . 

- L 


The secon d eq uality holds because we have shown in Proposition [TS] that the stopping time is finite with probability 1. The 
inequality (50i follows because w" G Aj implies Zji{n)(ui'^) > log((M — 1)L), that is, < jjjzrpjz- * 


B. Proof of Theorem Achievability 

We assume (I) and (Ilb). All statements in this proof are under H = Hi and under Sluggish Procedure A. We follow the proof 
technique of Chernoff ||9l Lem. 2]. Chernoff’s proof technique does not go through completely because unlike in Procedure A, 
the next action in Sluggish Procedure A is not conditionally independent of the previous action, given the current likelihood 
values. A similar issue was addressed by Nitinawarat and Veeravalli in iflTll . in the context of Markovian observation model, 
and we will adapt their proof technique to our setting. 

Let us first setup some notation. Fix e > 0. Define 

A, E A.(a)Z9(q“||(7“), 

a^A 

where Xi is as defined in Let Di be as defined by (j^, i.e., Di = min^^i Aj- Under the Sluggish Procedure A, the 
transition probability matrix TP{9{n)) of the action process A„ at time n is given by 


TP(0(n)) = (l-p)I +,7 (lAj(„)). (51) 

It is easy to verify that the stationary distribution associated with TP{9{n)) is Define A-i := the 

(T-field generated by the random variables 
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We now upper bound the expected time to make a decision under Sluggish Procedure A as follows: 

Ei [T{TTsA{L,r]))] < Ei [r(7r^^(L,7?))] 

= 51^* {^i^SAiL,v)) > n) 

n>0 

^ (l + e) log(L(M-l)) 

D, 

+ {^(^sa{E, v)) > n) , (52) 

n^n 


where 


(1 + e)log(L(M - 1)) 


To complete the proof, we will now show that for any e > 0, the second term on the right-hand side of (52i goes to zero 
as L —?> oo. Indeed, we claim that each term in the summation decays exponentially with n with an exponent that does not 
depend on L. Assuming the claim, the tail sum vanishes as L ^ oo, because h ^ oo. This suffices to complete the proof of 
Theorem 01 

We now proceed to prove the claim. Observe that 


{T{TrSAiL,r])) > n) 

< Pi ^mmZij{n) < log(L(M - 1))^ 

< VP,(Z„(n)<log(L(M-l))). 




Fix one j ^ i. (The same analysis holds for other j.) Then 


P,(Z„(n)<log(L(M-l))) 

/ n N 

= P^(Y. < log(T(M - 1)) 


\k^l 


= PJ ^ (AZ,,(fc) - P, [AZ,,(fc)|Pfe_i] + e') 

\fc-l 

n 

+ Y, (E, [AZ,^{k)\Pk-i] - Aj + e') 

k^l 

+ n (Py - 2e') < log(M - 1)L^ 

/ n ^ 

< PJ ^ (AA,(fc) - P, [AA,(fc)|Pfe_i] + e') < 0 


\k^l 


+ P^{Y [AA,(fc)|Pfe-i] - Py + e') < 0 


^fe=i 


+ P, (n(P„ - 2e') < log(P(M - 1))). 


(53) 


Look at the first probability term in ( [5^ . Each entry within the summation has a positive mean and, from Chernoff’s bounding 
technique in 0 Lem. 2], there exists a b{e') > 0 such that 

P^ (AA,(fc) - A [AA,(fc)|Pfe_i] + e') < 0^ < 

The third probability term is 0 if we choose an e' small enough such that n(Py — 2e') > log(P(M — 1)), for all n > n. 
Indeed, any e' satisfying 0 < e' < suffices. So set e' = 
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We now proceed to show that the second term also decays exponentially to zero. Let Ti be as defined in ( |47] l. For a suitably 
chosen e", and we will soon indicate how to choose it, we have 

F. ( ^ {Ei [A^i,(fe)| - A, + e') < 0 


< Fi ( ^ {Ei [AZij{k)\rk-i\ - Dij + t) <0, Ti< nt 

\k^l 

+ PiiTi > nt ). 


From Lemma 14 the second probability term on the right-hand side decays exponentially with n. To show that the first 
probability term on the right-hand side decays exponentially with n, we use a technique of Nitinawarat and Veeravalli lfT4l 
(6.23)]. 

First, we indicate how to choose t". Define 

C = min A [l\Zij{k)\Ak = a] - Aj 
a^A 

= ininL»(g“||g“) - Aj- 
a^A 

Since Dij is the A^-weighted average of D{q^\\q'j), we have C < 0. Choose t" small enough so that e := e' -f t''C > 0. We 
then have 


A ^ (A [AA,(fc)|A-i] - Aj + e') < 0, T, < nt" 


\k^l 




= A ^ (A [AA,(fc)|A-i] - A, + e') 




+ {Ei [AZij{k)\Ek-i] — Dij -f e') < 0, 

k—lne"] +1 


A < ne"j 
<pJlnt''\iC + t') 


+ {Ei [AZij{k)\Ek-i] — Dij -f t') < 0, 

k— [ne" j +1 


< Pi I (A \AZij{k)\Ek-i] — Dij -f e) < 0, 

,^fc=[rae"J-|-l 
T, < ne" I 


< A ^ (A [AA,(fc)|A-i] - A, + ?) < 0 

\fc=L"e"J-l-l 


(54) 


for some C > 0 and some 6(e) > 0. The second inequality follows from the fact that C < A [AZij{k)\Fk-i\ — Dij, for all 
k. The third inequality follows from the choice of e and the fact that 

Lne"J (C + e') + (n - [ne"J )e' > (n - [ne"J )e. 

Pi is a new measure under which actions are taken according to Sluggish Procedure A but assuming 9{n) = i Vn, and the 
observations are conditionally independent of past observations and actions, given the current action. Consequently, under Pi, 
the action process is a stationary Markov Chain with transition probability matrix TP{i). By the ergodic theorem and 
concentration inequalities for Markov Chains IHI, this term also decays exponentially with n, which is g. ■ 
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Appendix B 

Proof of Proposition!!] 

We will focus only on i <W and will determine 

Di = max min A(a)-D(( 7 “||g“). (55) 

XGViA) “ 
aGA 

The case i > W can be handled similarly and is omitted. Using (j2^ - 
considering three regions for j as follows: 

A = max min | min {X{i)D{fk\\fi) + X{j)D{fi\\fk)), 

XGViA) [3 <W,jA'‘ 

X{i)D{fk\\fi) + {l-Xit))D{Mfk), 

. min (1- X{i) - X{j-W))D{fi\\fk)\, (56) 

j>W,jA'i+W j 

= max mini X(i)D(fkH ft) + min X(j)D(fillfk), 

xeV(A) I jAi 

X{^)D{fk\\fl) + {l-X{^))D{fl\\h), 

(1 - X{i) - maxA(j))£>(/i||/fc)|. (57) 

jAi j 

Observe that the second term is always greater than or equal to the other two terms, and hence can be removed from the 
minimisation. Thus, 


27 1 , we can simplify the minimisation in (|55]l by 


A= max miniX(i)D(fkHft)+minX(j)D(fillfk), 
xev(A) I jAi 

(1 - X{i) - maxA(j))D(/i||/fc) j . (58) 

jA^ J 

We now perform the maximisation over A in two steps. First, let us fix A(i) and optimise over the distribution of 1 — A(i) 
among the other actions. Since 

• ^ \ ~ ^ / \ 
mmA(j) < ——- < maxA(j), 

3At VV — 1 jAi 


we have 

(mmA(j)) D{fi\\fk) < ^(/dl/fc) 

and 

-maxA(j)iA(/i||/fe) < -^r-^^£»(/i||/fc). 
3^1 W — I 


Thus both the terms within braces in (58 1 are lesser than or equal to the corresponding terms for equal distribution of 1 — X{i) 
among the other actions. The optimisation problem is now reduced to a single variable optimisation of the form 


^ ^^max^^min|A(2)D(/fc||/i) + D{fi\\fk), 

(59) 

Second, we now perform the optimisation in ( [59] l over X{i). The first term in the minimisation is increasing or non-increasing 
in X{i) depending on D{fk\\fi) > Difi\\fk)/{W - 1) or D{fk\\fi) < D{fi\\fk)/iW- 1), respectively. The second term is 
decreasing in X{i). 

1) Suppose D{fk\\fi) > D{fi\\fk)/{W — 1), then the two terms viewed as linear functions over X{i) cross each other, and 
so the maximum will be achieved at the point of equality, i.e.. 
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Solving for X{i) yields 


A( 2 ) 

A(j) 

D, 


{W-3)D{Mfk) 

{W-l)D{M\fi) + {W-3)DiMfky 
_ DjfkWfi) _ 

{W-l)DiM\fi) + {W-3)D{Mfky 
iW-2)D{fy\fi)DiMfk) 
(W-l)DiM\fi) + {W-3)D{Mfky 


y j yi , 


2) Suppose D{fy\fi) < D{fi\\fi.)/{W — 1), then the maximum is achieved at X{i) = 0. Then A(j) 
and 


Di = min 


DjfiWfk) {W-2)DiMfk) 

W-1 ' W-1 


DjfiWfk) 

W-1 


1/(W-1), yjyi. 


since IT > 3. ■ 


Appendix C 

Estimation of Relative Entropy Rate 

The computation of our proposed neuronal index requires a computation of the relative entropy rate between two Poisson 
point processes from estimates of their rates. The relative entropy between two Poisson point processes with rates Ri and i ?2 
is 


7j^D{^li^^T\\fkR2,T) = Rl log + R 2 - Rl 

= Rl log Rl - Rl log R 2 + R 2 - Rl- 


(60) 


Let Ni(k,T) be the number of spikes observed in time slot fc, 1 < fc < n, of duration T on the process, i = 1,2. The 


empirical firing rate is then Ri — ^ natural estimate for ( 6 O 1 , based on the observations, would be to 


substitute Ri, z = 1,2 by their respective empirical estimates Ri, z = 1 , 2 , to get 


Rl 


1 log ( - 

y 

= Rl log Rl — Rl log i ?2 + i ?2 ~ Rl ■ 


D — Rl log + i ?2 ~ .Rl 

\R2J 


( 61 ) 


A little reflection suggests that this is a bad estimate, for there is a positive probability that Ri > R 2 = 0, yielding 


Eri,R 2 E\ = 00 . Estimate (611 is thus biased (though consistent). Our approach is to obtain estimates for each of the 


terms in ( | 6 D| l with minimal bias. 

Unbiased and maximum likelihood estimates for the third and fourth terms on the right hand side of are the respective 
empirical bring rates themselves. Let us therefore now study the second term. We may assume that the brings are independent, 
given Rl and i? 2 . Thus we may look for an estimator of the form —Rif{R 2 ) which has expectation —[/(R 2 )]- For this 
to be close to the desired —Ri logi? 2 , we look for a function /(R 2 ) such that Er^ /(R 2 ) ~ logi? 2 - The difficulty is due 
to the log(O) = —00 artifact. We consider a simple fix of adding a nonzero offset to the empirical estimate, i.e., we consider 
estimates of the form log(R + 9). Eigure shows the optimum offset 9*{R) for different firing rates R when n = T = 1. 
The optimum offset 9*{R) can be seen to converge to 0.5 for large R. Further, the convergence is quite fast, 9*{R) is close 
to 0.5 for all R greater than 3. Hence in this work we use 9 = 0.5 as the offset, thus resulting in an estimator for log(i?) 
of the form \og{R + 1/2). For a general n and T we then have E log{nTR+ 1/2) « log(nTi?), which in turn implies 


E 


\og{R + l/2nT) 


log(i?). Thus an estimator for a general n and T would be log (jk + l/2nT^. The estimator for the 
second term in (|60ll is then —Ri log ^i ?2 + l/2nr^. One could look for better estimators with the offset being a function of 


the observed empirical means. In this work we stick to the constant offset estimator, with the constant offset being 9 = 0.5, 
as it is reasonable to assume that the neurons have a firing rate greater than 3/nT = 3/(24 * 0.25) = 0.5 spikes/second (n = 
24, T = 250 ms), thus putting them in the firing rate regime where 9 = 0.5 is a good offset for near unbiasedness. The values 
T = 250ms and n = 24 correspond to the neuronal recording time and the number of repetitions in the neuronal recording 
experiment of Sripati and Olson IT]. 
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Fig. 6. Optimum offset (9*(R)) to minimise bias for different firing rates (R). 


To address the first term of (60 1 we consider estimates of the form Rig{Ri) such that 


E 


Ri 


Ri9{Ri) 


= Ri logiii. 


Expanding the expectation above for n = T = 1, we obtain. 


Eri Ri9{Ri) = fcg(fc) 




k\ 


“ nk-l 


fc=i 




9{Ri + 1 ) 


Thus we want a g such that Ef 


g{Ri + 1) = log Ri- From the discussion on the second term, we know that Er^ log(i?i + j) 

logi?i, and hence a good choice for g would be g{Ri) = log(i?i — j). Thus our estimate for the hrst term for a general n 
and T is 




if Ri > 2 ^, i.e., there is atleast one point, 
otherwise. 


Therefore our combined estimate for the relative entropy rate in (60 1 , based on the average bring rate estimates i?i and R 2 
and obtained over a time of duration nT is 


D{Ri\\R2) 



if Rl > 2nT 


Otherwise. 


Relative entropy being a convex function of its arguments, the plug-in estimator of ( [M] l would always have a positive bias. 
Naturally, an unbiased estimator would have a smaller value than (jhlj, and our proposed estimator does satisfy this requirement. 

In Figure 1^ we plot the estimator bias for different pairs for n = 24 and T = 250 ms, motivated by the specibc 

neuronal experimental data of Sripati and Olson fH. From Figure |7] we can see that our proposed estimator has low estimation 
error for most (i?i, i? 2 ). Estimation error is relatively large only when i?i is large and i ?2 is close to zero. 
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